Bayesian Modeling of Panel Data with R Package dynamite

useR! 2024

Santtu Tikka (joint work with Jouni Helske)

University of Jyväskylä

Panel Data

  • Data from multiple individuals \(i = 1,\ldots,N\)
    • Persons, countries, firms, …
  • Measured over multiple time points \(t = 1,\ldots,T\).
    • Can be calendar time, age, or some other variable defining the order of the observations.
    • We focus on cases with relatively large \(T\) (e.g., \(T = 100\)), sometimes referred to as intensive longitudinal data.
  • Multiple measurements per individual \(c = 1,\ldots,C\)
    • Person’s yearly income, health status, and partnership status, place of residence, date of birth, …
  • Often used for observational causal inference.

Statistical Methods

  • Several modelling approaches, depending on the observed and assumed characteristics of the data and the underlying causal graph.
  • Often based on some variant of cross-lagged panel model (CLPM).
    • e.g., Allison et al. 2017, Hamaker et al. 2015, Asparouhov et al. 2018
  • Estimation is typically based on structural equation models (SEM) using maximum likelihood.
  • Models are sometimes formulated on a latent level instead of the observational level
    • This can make causal interpretations challenging

Dynamic Multivariate Panel Models

The dynamic multivariate panel model (DMPM), implemented in R package dynamite, supports

  • Multiple response variables with various distributions, e.g., Gaussian, Poisson, binomial, categorical, …
  • Time-varying and time-invariant effects
    • Time-varying effects are defined using Bayesian P-splines
  • Subject-specific random effects
  • Latent factors
  • Fully Bayesian estimation via Stan
    • Also provides us with posterior distributions of causal effects
    • Both rstan and cmdstanr are supported
  • Models are defined on the observational level instead of the latent level

Employment Trajectories

As an illustration, we study a subset of the data from Swiss Household Panel (Tillmann et al. 2016), obtained from the R package march (Maitre et al. 2020)

  • The data consist of the employment trajectories of 845 individuals (421 women and 424 men).
  • The employment response variable fulltime was observed every two years from age of 20 to the age of 44 (i.e., 13 time points)
    • coded as being either 1 (full-time employee) or 0 (other).
  • We assume that the employment status depends on the full work history,
    • simplified here to an indicator for if the person has ever worked full time or not (ever)
  • For simplicity, we consider only one time-invariant covariate:
    • gender coded as 0 = woman, 1 = man (time-invariant)

Preparing the Data

First, we prepare the data for analysis

library("dplyr")

library("tibble")

library("march")

library("dynamite")



N <- Employment.2@N

d <- data.frame(

  employment = factor(c(Employment.2@yRaw), labels = c("Full-time", "Other")),

  gender = factor(Employment.2@cov[,,1], labels = c("Woman", "Man")),

  id = 1:N, 

  age = rep(seq(20, 44, by = 2), each = N)

) |>

  mutate(fulltime = as.integer(employment == "Full-time"))

Model Structure

The graph above depicts our assumed model structure (causal diagram)

  • \(y_{t}\) is the employment status at time \(t\) (data variable fulltime)
  • \(h_{t}\) is the work history status at time \(t\) (data variable ever)
  • \(g\) denotes the gender (data variable gender)

Defining the Model

model_formula <- 

  obs(

    fulltime ~ -1 + gender:lag(ever) + varying(~ -1 + gender + gender:lag(fulltime)),

    family = "bernoulli"

  ) +

  aux(numeric(ever) ~ fulltime == 1 | lag(ever) == 1 | init(0)) +

  splines(df = 6, noncentered = TRUE)
  • dynamite uses (mostly) standard R formula syntax for defining models
  • obs() defines a stochastic response variable
  • varying() defines the time-varying part of the model formula
  • lag() creates a first-order lagged variable (higher-order lags are supported)
  • aux() defines a determinisic response variable
    • similar to the I() function, but also works for predict() dynamically
  • splines() defines the parametrization of the splines for time-varying effects
  • Definitions of individual model components are combined via +

Customizing the Priors

priors <- get_priors(model_formula, data = d, group = "id", time = "age")

priors$prior[priors$type == "tau"] <- "normal(0, 1)"

as_tibble(priors)
# A tibble: 10 × 5

   parameter                                response prior        type  category

   <chr>                                    <chr>    <chr>        <chr> <chr>   

 1 beta_fulltime_genderWoman:ever_lag1      fulltime normal(0, 2) beta  ""      

 2 beta_fulltime_genderMan:ever_lag1        fulltime normal(0, 2) beta  ""      

 3 delta_fulltime_genderWoman               fulltime normal(0, 2) delta ""      

 4 delta_fulltime_genderMan                 fulltime normal(0, 2) delta ""      

 5 delta_fulltime_genderWoman:fulltime_lag1 fulltime normal(0, 2) delta ""      

 6 delta_fulltime_genderMan:fulltime_lag1   fulltime normal(0, 2) delta ""      

 7 tau_fulltime_genderWoman                 fulltime normal(0, 1) tau   ""      

 8 tau_fulltime_genderMan                   fulltime normal(0, 1) tau   ""      

 9 tau_fulltime_genderWoman:fulltime_lag1   fulltime normal(0, 1) tau   ""      

10 tau_fulltime_genderMan:fulltime_lag1     fulltime normal(0, 1) tau   ""      
  • dynamite tries to select suitable priors for the model parameters
  • get_priors() gives the default priors as a data frame
  • Priors can be customized by replacing the priors we want by entering a suitable distribution in the prior column
    • The prior is given in Stan syntax and should be supported by Stan

Fitting the Model

fit <- dynamite(

  model_formula, data = d, group = "id", time = "age", priors = priors,

  chains = 4, cores = 4, iter = 5000, refresh = 0

)
  • We fit the DMPM with dynamite()
  • group and time define the columns of data that identify individuals and time points, respectively
  • priors (optional) defines the priors (if not default)
  • chains, cores, iter and refresh are arguments to Stan (default backend is rstan)
    • We compute 4 parallel chains with 5000 posterior draws each

Summarizing the Model Fit

The default print() method provides a summary of the model fit including some diagnostics

print(fit)
Model:

         Family       

fulltime bernoulli    

ever     deterministic

         Formula                                                                        

fulltime fulltime ~ -1 + gender:lag(ever) + varying(~-1 + gender + gender:lag(fulltime))

ever     numeric(ever) ~ fulltime == 1 | lag(ever) == 1 | init(0)                       



Data: d (Number of observations: 10140)

Grouping variable: id (Number of groups: 845)

Time index variable: age (Number of time points: 13)



NUTS sampler diagnostics:



No divergences, saturated max treedepths or low E-BFMIs.

Summarizing the Model Fit

Smallest bulk-ESS: 6898 (tau_fulltime_genderWoman)

Smallest tail-ESS: 5691 (tau_fulltime_genderWoman:fulltime_lag1)

Largest Rhat: 1.001 (beta_fulltime_genderWoman:ever_lag1)



Elapsed time (seconds) for fastest and slowest chains:

         warmup  sample

chain:2 131.777 161.224

chain:1 130.105 166.147



Summary statistics of the time- and group-invariant parameters:

# A tibble: 6 × 10

  variable          mean median    sd   mad     q5   q95  rhat ess_bulk ess_tail

  <chr>            <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>

1 beta_fulltime_… 0.575  0.577  0.221 0.222  0.204 0.937  1.00   12463.    8182.

2 beta_fulltime_… 0.0677 0.0684 0.230 0.230 -0.308 0.448  1.00   11623.    7958.

3 tau_fulltime_g… 1.13   1.06   0.388 0.356  0.619 1.86   1.00    6954.    7094.

Other functions to query the model fit include:

  • summary(): provides summary statistics of all model parameters
  • coef(): provides summary statistics of regression coefficients (time-invariant and time-varying)
  • mcmc_diagnostics(): computes effective sample sizes and convergence diagnostics

Visualizing the Model Fit

The default plot() method visualizes the posterior distributions of the model parameters

plot(fit)

Causal Inference with DMPMs

  • Our approach to causal inference is based on structural causal models (SCM, Pearl, 2009)
    • Causal effects are defined through interventions; external changes to the data-generating process
    • Notation: \(P(Y = y|do(X = x))\) is the causal effect of \(X\) on \(Y\)
  • We are interested on how an intervention on employment at the age of 30 affects the future employment status.
    • We compare two interventions: forcing everyone to work full time vs. forcing everyone to not work full time at the age of 30.
    • i.e., \(P(y_{t} = 1|do(y_{30} = 1))\) and \(P(y_{t} = 1|do(y_{30} = 0))\)
  • Assuming that the model structure of the DMPM depicts the true data-generating process, we can obtain the posterior distribution of the causal effect (interventional distribution)
    • We can then assess the magnitude and uncertainty of these effects

Employment Interventions

  • We can use the backdoor adjustment to compute the causal effect of employment at age \(t\) on employment at any future age \(t+k\)
  • The effect is identified as: \[ \begin{aligned} P(y_{t+k}|do(y_{t})) &= \sum_{h_{t},g} P(y_{t+k}|y_t,h_t,g)P(h_t,g) \\ &= \sum_{y_{t+1},\ldots,y_{t+k-1},h_t,g} \prod_{j = 1}^{k} P(y_{t+j},h_{t+j}|y_{t+j-1},h_{t+j-1},g)P(h_t, g) \end{aligned} \]
  • What we need:
    1. \(P(y_{t+j},h_{t+j}|y_{t+j-1},h_{t+j-1},g)\): Our model fit is exactly this!
    2. \(P(h_t,g)\): We can use the empirical distribution of the data instead of explicit modeling

Implementing the Employment Interventions

  • We fix the employment status (fulltime) for everyone at age 30 while keeping the values of other variables the same as the data
  • We set the future values to NA as we will predict these with dynamite to obtain the counterfactual values
  • Finally, we predict starting from age 30 to obtain the causal effects at each age
# No full time employment at age 30

newdata0 <- d |> filter(age >= 28)

newdata0$fulltime[newdata0$age == 30] <- 0

newdata0$fulltime[newdata0$age > 30] <- NA



# Full time employment at age 30

newdata1 <- d |> filter(age >= 28)

newdata1$fulltime[newdata1$age == 30] <- 1

newdata1$fulltime[newdata1$age > 30] <- NA

Employment Interventions

We compute the predictions using predict() and contrast the two interventions

pred <- 

  bind_rows(

    no = predict(

      fit, newdata = newdata0,

      funs = list(fulltime = list(mean = mean))

    )$simulated,

    yes = predict(

      fit, newdata = newdata1,

      funs = list(fulltime = list(mean = mean))

    )$simulated,

    .id = "fulltime_30"

  ) |> 

  filter(age > 28) |> 

  reframe(

    difference = mean_fulltime[fulltime_30 == "yes"] - mean_fulltime[fulltime_30 == "no"],

    .by = "age"

  ) |>

  group_by(age) |>

  summarise(

    mean = mean(difference),

    q2.5 = quantile(difference, 0.025),

    q97.5 = quantile(difference, 0.975),

  )

The funs argument to predict() specifies functions to apply over the predicted values.

Employment Interventions

For contrast, we compute the same for the observed data, i.e., the differences in full time employment at each time point

obs_sumr <- d |> 

  filter(age > 28) |> 

  group_by(id) |>

  mutate(fulltime_30 = ifelse(fulltime[age == 30], "yes", "no")) |>

  group_by(age, fulltime_30) |>

  summarise(mean_fulltime = mean(fulltime)) |>

  group_by(age) |>

  summarise(

    mean = mean(mean_fulltime[fulltime_30 == "yes"] - mean_fulltime[fulltime_30 == "no"]),

    .groups = "keep"

  )

comb <- bind_rows(

  intervention = pred,

  observation = obs_sumr, 

  .id = "Type"

) 

Employment Interventions

Finally, we plot the results

Some Other Features of dynamite

  • Subject-specific random effects
    • Can be added to model formulas of individual response variables with random() (similar to varying())
    • The model component random_spec() defines the parametrization and whether the effects should be correlated
  • Latent factors
    • A common latent factor can be added via the model component lfactor()
    • Can be used to model e.g., how individuals respond differently to some unobserved, common process
  • Customized model code
    • The Stan code of the model can be obtained via get_code() to be customized for specialized user-specific scenarios
    • No need to start from scratch!

Some Other Features of dynamite

  • Within-chain (and between-chain) parallelization support
  • Data-generation from DMPMs
  • Multiple imputation

Further Information

  • The dynamite package is part of rOpenSci
  • Actively maintained and developed at GitHub: https://github.com/ropensci/dynamite
  • The package is available from CRAN, GitHub and R-universe
  • For details on DMPMs, see (Helske and Tikka, 2024)

References

  • Tikka S, Helske J (2024). dynamite: An R Package for Dynamic Multivariate Panel Models. https://arxiv.org/abs/2302.01607
  • Helske J, Tikka S (2024). Estimating Causal Effects from Panel Data with Dynamic Multivariate Panel Models. Advances in Life Course Research, 60, 100617
  • Allison PD, Williams R, Moral-Benito E (2017). Maximum Likelihood for Cross-Lagged Panel Models with Fixed Effects. Socius, 3.
  • Asparouhov T, Hamaker EL, Muthén B (2018). Dynamic Structural Equation Models. Structural Equation Modeling: A Multidisciplinary Journal, 25(3), 359–388.
  • Hamaker EL, Kuiper RM, Grasman RP (2015). A Critique of the Cross-Lagged Panel Model. Psychological methods, 20(1), 102.
  • Maitre O, Emery K, Buschor O, Berchtold A (2020). march: Markov Chains. R package version 3.3.2.
  • Pearl J (2009). Causality: Models, Reasoning and Inference. 2nd Edition. Cambridge University Press